A Survey on Advanced Fraud Detection: Leveraging K-SMOTEENN and Stacking Ensemble to Tackle Data Imbalance and Extract Insights

Authors: Vyshnavi M, Chaithra M, Dhanya Shetty, Nireeksha Pai B N

DOI Link: https://doi.org/10.22214/ijraset.2025.74246

Abstract

DCCF is a major problem for financial security because the datasets are very uneven and fraudulent activities are becoming more and more sophisticated. To solve these problems, a new detection framework is made by combining k-means SMOTEENN with a stacking ensemble of ML algorithms. The dataset, which has a lot of class imbalance and contains anonymised transaction records, is preprocessed and balanced using k-means SMOTEENN. This method oversamples instances from the minority class while getting rid of noisy synthetic samples, which lowers the chance of overfitting. A stacking ensemble is built on this balanced dataset by combining different classifiers like XGBoost, decision Tree, and other base learners. A meta-learner then combines their outputs to make a strong decision boundary. The experimental findings show that this model works far better than standalone classifiers, “with an F1-score of 0.92, a precision of 0.95, a recall of 0.88, an area under the precision-recall curve (AUPRC) of 0.96, and a perfect ROC-AUC of 1.00. To make dependability even better, Explainable AI (XAI) methods are used. for example, local Interpretable model-Agnostic explanations (LIME)” show how important features affect the model and explain the roles of base and meta-learners. This guarantees both high detection accuracy and ease of understanding, making it possible to find fraud safely and clearly in real-world financial settings.

Introduction

The rapid increase in financial transactions, driven by banks and online shopping, has boosted digital payment systems but also escalated credit card fraud, posing significant economic and security challenges. Credit card fraud losses have risen dramatically over recent years, highlighting the need for advanced detection methods.

Traditional fraud detection techniques based on manual rules and simple statistical models are inadequate against increasingly sophisticated fraud tactics. Machine learning (ML) and deep learning (DL) frameworks, capable of identifying complex patterns, are essential. A major challenge in ML fraud detection is class imbalance—fraudulent transactions are far fewer than legitimate ones—leading to biased models that miss fraud. Techniques like K-SMOTEENN, which oversample minority fraud cases and remove noise, help balance datasets and improve detection accuracy, especially when combined with ensemble learning methods like stacking.

The literature review reveals a trend towards hybrid ML models and ensemble classifiers that address data imbalance, noise, and scalability. Random forests, ensemble voting methods, and specialized sampling techniques (e.g., NUS, adaptive-weighted oversampling) have been proposed to enhance performance. These approaches prioritize precision and recall over accuracy due to skewed data.

The proposed system uses a large synthetic credit card dataset (PaySim), preprocesses it to clean and engineer features, and balances it with K-SMOTEENN. Multiple classifiers (XGBoost, Decision Tree, Random Forest, Logistic Regression) are trained and combined in a stacking ensemble model for robust fraud detection. Explainable AI methods like LIME are employed to interpret model predictions.

Performance metrics (accuracy, precision, recall, F1-score, AUC-ROC) show that the stacking ensemble outperforms individual classifiers, delivering high fraud detection accuracy on highly imbalanced data.

Conclusion

The suggested approach shows a sophisticated and successful way to find credit card fraud by dealing with the important problems of data imbalance, overfitting, and interpretability. K-means SMOTEENN is used on the very unbalanced anonymized credit card transaction dataset to improve class distribution by making synthetic samples for the minority class and filtering out noisy or duplicate data through Edited Nearest Neighbors. This improves data quality and lowers the risk of overfitting. A stacking ensemble framework is built on this balanced dataset “by combining classifiers like XGBoost, Decision Tree, Random Forest, and Logistic Regression”. A meta-learner then combines the outputs of these classifiers to create a strong and general decision boundary. The experimental evaluation shows that the results are far better than those of the separate baseline classifiers. “The F1-score is 0.92, the precision is 0.95, the recall is 0.88, the AUPRC is 0.96, and the ROC-AUC is 1.00. Also, Explainable AI (XAI) methods are used, and Local Interpretable Model-Agnostic Explanations (LIME)” make things clearer by showing which features are most important for making predictions and how base and meta-learners work together. This mix of resampling approaches, ensemble algorithms, and interpretability makes for a very accurate, dependable, and clear fraud detection system that works well in real-world financial settings. The future of this system includes improving fraud detection by adding DL models like LSTM and Transformer-based architectures to find trends in transactions over time. Adding more financial records from different sources, such as cross-border and real-time transactions, can make the dataset even more robust. Federated learning will help keep model training private between universities. Also, moving Explainable AI beyond LIME with techniques like SHAP can make it easier to understand, which will help financial companies use flexible, scalable, and clear fraud detection solutions in changing situations.

References

[1] Mienye, I. D., & Jere, N. (2024). Deep learning for credit card fraud detection: A review of algorithms, challenges, and solutions. IEEE Access, 12, 96893–96910. [2] Ileberi, E., Sun, Y., & Wang, Z. (2021). Performance evaluation of machine learning methods for credit card fraud detection using SMOTE and AdaBoost. IEEE Access, 9, 165286–165294. [3] Alfaiz, N. S., & Fati, S. M. (2022). Enhanced credit card fraud detection model using machine learning. Electronics, 11(4), 662. [4] Gupta, P., Varshney, A., Khan, M. R., Ahmed, R., Shuaib, M., & Alam, S. (2023). Unbalanced credit card fraud detection data: A machine learning-oriented comparative study of balancing techniques. Procedia Computer Science, 218, 2575–2584. [5] Mienye, I. D., & Sun, Y. (2023). A deep learning ensemble with data resampling for credit card fraud detection. IEEE Access, 11, 30628–30638. [6] Singh, A., Ranjan, R. K., & Tiwari, A. (2022). Credit card fraud detection under extreme imbalanced data: A comparative study of data-level algorithms. Journal of Experimental & Theoretical Artificial Intelligence, 34(4), 571–598. [7] Khan, A. A., Chaudhari, O., & Chandra, R. (2023). A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation, and evaluation. Expert Systems with Applications, 244, 122778. [8] Ni, L., Li, J., Xu, H., Wang, X., & Zhang, J. (2023). Fraud feature boosting mechanism and spiral oversampling balancing technique for credit card fraud detection. IEEE Transactions on Computational Social Systems, 11(2), 1615–1630. [9] Marazqah Btoush, E. A. L., Zhou, X., Gururajan, R., Chan, K. C., Genrich, R., & Sankaran, P. (2023). A systematic review of literature on credit card cyber fraud detection using machine and deep learning. PeerJ Computer Science, 9, e1278. [10] Mim, M. A., Majadi, N., & Mazumder, P. (2024). A soft voting ensemble learning approach for credit card fraud detection. Heliyon, 10(3), e25466. [11] Aung, M. H., Seluka, P. T., Fuata, J. T. R., Tikoisuva, M. J., Cabealawa, M. S., & Nand, R. (2020). Random forest classifier for detecting credit card fraud based on performance metrics. In Proceedings of IEEE Asia–Pacific Conference on Computer Science and Data Engineering (CSDE) (pp. 1–6). [12] Aburbeian, A. M., & Ashqar, H. I. (2023). Credit card fraud detection using enhanced random forest classifier for imbalanced data. In Proceedings of International Conference on Advanced Computing Research (pp. 605–616) [13] Zhu, H., Zhou, M., Liu, G., Xie, Y., Liu, S., & Guo, C. (2023). NUS: Noisy sample-removed undersampling scheme for imbalanced classification and application to credit card fraud detection. IEEE Transactions on Computational Social Systems, 11(2), 1793–1804. [14] Wang, X., Gong, J., Song, Y., & Hu, J. (2023). Adaptively weighted three-way decision oversampling: A cluster imbalanced-ratio based approach. Applied Intelligence, 53(1), 312–335. [15] Abd El-Naby, A., Hemdan, E. E.-D., & El-Sayed, A. (2023). An efficient fraud detection framework with credit card imbalanced data in financial services. Multimedia Tools and Applications, 82(3), 4139–4160. [16] Hashemi, S. K., Mirtaheri, S. L., & Greco, S. (2022). Fraud detection in banking data by machine learning techniques. IEEE Access, 11, 3034–3043. [17] Alamri, M., & Ykhlef, M. (2024). Hybrid undersampling and oversampling for handling imbalanced credit card data. IEEE Access, 12, 14050–14060. [18] Lopez-Rojas, E. A., Elmir, A., & Axelsson, S. (2016). PaySim: A financial mobile money simulator for fraud detection. In Proceedings of the 28th European Modeling & Simulation Symposium (EMSS) (pp. 249–255). Larnaca, Cyprus. [19] García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining (Vol. 72). Cham, Switzerland: Springer. [20] Chatfield, C. (1986). Exploratory data analysis. European Journal of Operational Research, 23(1), 5–13.

Copyright

Copyright © 2025 Vyshnavi M, Chaithra M, Dhanya Shetty, Nireeksha Pai B N. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET74246

Publish Date : 2025-09-15

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here